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Abstract 

We propose a randomized second-order method for optimization known as the New¬ 
ton Sketch: it is based on performing an approximate Newton step using a randomly 
projected or sub-sampled Hessian. For self-concordant functions, we prove that the al¬ 
gorithm has super-linear convergence with exponentially high probability, with conver¬ 
gence and complexity guarantees that are independent of condition numbers and related 
problem-dependent quantities. Given a suitable initialization, similar guarantees also hold 
for strongly convex and smooth objectives without self-concordance. When implemented 
using randomized projections based on a sub-sampled Hadamard basis, the algorithm 
typically has substantially lower complexity than Newton’s method. We also describe 
extensions of our methods to programs involving convex constraints that are equipped 
with self-concordant barriers. We discuss and illustrate applications to linear programs, 
quadratic programs with convex constraints, logistic regression and other generalized lin¬ 
ear models, as well as semideflnite programs. 


1 Introduction 

Relative to first-order methods, second-order methods for convex optimization enjoy superior 
convergence in both theory and practice. Eor instance, Newton’s method converges at a 
quadratic rate for strongly convex and smooth problems, and moreover, even for weakly 
convex functions (i.e. not strongly convex), modihcations of Newton’s method has super-linear 
convergence compared to the much slower l/T^ convergence rate that can be achieved by a 
first-order method like accelerated gradient descent (see e.g. m)- More importantly, at least 
in a uniform sense, the 1/T^-rate is known to be unimprovable for hrst-order methods |17| . 
Yet another issue in hrst-order methods is the tuning of step size, whose optimal choice 
depends on the strong convexity parameter and/or smoothness of the underlying problem. 
Eor example, consider the problem of optimizing a function of the form x i—>• g{Ax), where 
A G is a “data matrix”, and g : —>■ M is a twice-differentiable function. Here the 

performance of hrst-order methods will depend on both the convexity/smoothness of g, as well 
as the conditioning of the data matrix. In contrast, whenever the function g is self-concordant, 
then Newton’s method with suitably damped steps has a global complexity guarantee that is 
provably independent of such problem-dependent parameters. 
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On the other hand, each step of Newton’s method requires solving a linear system defined 
by the Hessian matrix. For instance, in application to the problem family just described 
involving an n x d data matrix, each of these steps has complexity scaling as 0{nd‘^). For 
this reason, both forming the Hessian and solving the corresponding linear system pose a 
tremendous numerical challenge for large values of (n, d )— for instance, values of thousands 
to millions, as is common in big data applications, In order to address this issue, a multitude of 
different approximations to Newton’s method have been proposed and studied in the literature. 
Quasi-Newton methods form estimates of the Hessian by successive evaluations of the gradient 
vectors and are computationally cheaper. Examples of such methods include DFP and BFGS 
schemes and also their limited memory versions (see the book |25] for further details). A 
disadvantage of such approximations based on first-order information is that the associated 
convergence guarantees are typically much weaker than those of Newton’s method and require 
stronger assumptions. Under restrictions on the eigenvalues of the Hessian (strong convexity 
and smoothness), Quasi-Newton methods typically exhibit local super-linear convergence. 

In this paper, we propose and analyze a randomized approximation of Newton’s method, 
known as the Newton Sketch. Instead of explicitly computing the Hessian, the Newton Sketch 
method approximates it via a random projection of dimension m. When these projections 
are carried out using the randomized Hadamard transform, each iteration has complexity 
0{ndlog{m) + dm?). Our results show that it is always sufficient to choose m proportional 
to min{d, n}, and moreover, that the sketch dimension m can be much smaller for certain 
types of constrained problems. Thus, in the regime n > d and with m d, the complexity 
per iteration can be substantially lower than the 0{nd^) complexity of each Newton step. 
Specifically for n > d?, the complexity of Newton Sketch per iteration is 0{nd\ogd), which 
is linear in the input size (nd) and comparable to first order methods which only access the 
derivative g'{Ax). Moreover, we show that for self-concordant functions, the total complexity 
of obtaining a 5-optimal solution is C>(ndlogdlog(l/(j)), and does not depend on constants 
such as strong convexity or smoothness parameters unlike first order methods. On the other 
hand, for problems with d > n, we also provide a dual strategy which effectively has the same 
guarantees with roles of d and n exchanged. 


We also consider other random projection matrices and sub-sampling strategies, including 
partial forms of random projection that exploit known structure in the Hessian. For self- 
concordant functions, we provide an affine invariant analysis proving that the convergence 
is linear-quadratic and the guarantees are independent of the function and data, such as 
condition numbers of matrices involved in the objective function. Finally, we describe an 
interior point method to deal with arbitrary convex constraints which combines the Newton 
sketch with the barrier method. We provide an upper bound on the total number of iterations 
required to obtain a solution with a pre-specified target accuracy. 


The remainder of this paper is organized as follows. We begin in Section with some 
background on the classical form of Newton’s method, random matrices for sketching, and 
Gaussian widths as a measure of the size of a set. In Section we formally introduce the 
Newton Sketch, including both fully and partially sketched versions for unconstrained and 
constrained problems. We provide some illustrative examples in Section [3.2| before turning to 
local convergence theory in Section |3.3[ Sectionis devoted to global convergence results for 


self-concordant functions, in both the constrained and unconstrained settings. In Section 
we consider a number of applications and provide additional numerical results. The bulk 
of our proofs are in given in Section with some more technical aspects deferred to the 
appendices. 
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2 Background 


We begin with some background material on the standard form of Newton’s method, various 
types of random sketches, and the notion of Gaussian width as a complexity measure. 

2.1 Classical version of Newton’s method 

In this section, we briefly review the convergence properties and complexity of the classical 
form of Newton’s method; see the sources for further background. 

Let / : M'’* —)> M be a closed, convex and twice-differentiable function that is bounded 
below. Given a convex set C, we assume that the constrained minimizer 

X* : = argmin/(x) ( 1 ) 

xGC 

is uniquely defined, and we define the minimum and maximum eigenvalues 7 = Amm(V^/(x*)) 
and /3 = Amax(V^/(x*)) of the Hessian evaluated at the minimum. 

We assume moreover that the Hessian map x 1 —?■ V^/(x) is Lipschitz continuous with 
modulus L, meaning that 


|||v2/(x + A)-V2/(^)|||op<L||A||2. 


( 2 ) 


Under these conditions and given an initial point ^ C such that ||x® — x *||2 < 2 X> 
Newton updates are guaranteed to converge quadratically—viz. 


|x*+^ -X* 


< 


2L, 

7 


X — X 


||2 

Il2) 


This result is classical: for instance, see Boyd and Vandenberghe [1] for a proof. Newton’s 
method can be slightly modified to be globally convergent by choosing the step sizes via a 
simple backtracking line-search procedure. 

The following result characterizes the complexity of Newton’s method when applied to 
self-concordant functions and is central in the development of interior point methods (for 
instance, see the books [IHl m- We defer the definitions of self-concordance and the line- 
search procedure in the following sections. The number of iterations needed to obtain a 5 
approximate minimizer of a strictly convex self-concordant function / is bounded by 


aZ -2a) log 2 (l/«). 

where a, b are constants in the line-search procedurej^ 


2.2 Different types of randomized sketches 

Various types of randomized sketches are possible, and we describe a few of them here. Given 
a sketching matrix S G we use to denote the collection of its n-dimensional 

rows. We restrict our attention to sketch matrices that are zero-mean, and that are normalized 
so that K[S'^S/m\ = In- 

Typical values of these constants are a = 0.1 and b — 0.5. 
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Sub-Gaussian sketches: The most classical sketch is based on a random matrix S G 
with i.i.d. standard Gaussian entries, or somewhat more generally, sketch matrices based on 
i.i.d. sub-Gaussian rows. In particular, a zero-mean random vector s G MT" is 1-sub-Gaussian 
if for any u G M”, we have 


P[(s, u) > e||u|| 2 ] < e for all e > 0. (3) 

For instance, a vector with i.i.d. A^(0,1) entries is 1-sub-Gaussian, as is a vector with i.i.d. 
Rademacher entries (uniformly distributed over {—1,-|-1}). We use the terminology sub- 
Gaussian sketch to mean a random matrix S G with i.i.d. rows that are zero-mean, 

1-sub-Gaussian, and with cov(s) = In- 

From a theoretical perspective, sub-Gaussian sketches are attractive because of the well- 
known concentration properties of sub-Gaussian random matrices (e.g., mm)- On the other 
hand, from a computational perspective, a disadvantage of sub-Gaussian sketches is that they 
require matrix-vector multiplications with unstructured random matrices. In particular, given 
a data matrix A G computing its sketched version SA requires 0{mnd) basic operations 

in general (using classical matrix multiplication). 

Sketches based on randomized orthonormal systems (ROS): The second type of 
randomized sketch we consider is randomized orthonormal system (ROS), for which matrix 
multiplication can be performed much more efficiently. In order to define a ROS sketch, we 
first let H G be an orthonormal matrix with entries R,-, G [—)=, ^1. Standard classes 

o y n ’ y n J 

of such matrices are the Hadamard or Fourier bases, for which matrix-vector multiplication 
can be performed in 0{n log n) time via the fast Hadamard or Fourier transforms, respectively. 
Based on any such matrix, a sketching matrix S G from a ROS ensemble is obtained 

by sampling i.i.d. rows of the form 

s^ = y/nejHD with probability 1/n for j = 1,..., n, 

where the random vector Cj G M"" is chosen uniformly at random from the set of all n canon¬ 
ical basis vectors, and D = diag(i^) is a diagonal matrix of i.i.d. Rademacher variables 
u G { —1,-|-1}”'. Given a fast routine for matrix-vector multiplication, the sketch SM for a 
data matrix M G can be formed in 0{nd\og m) time (for instance, see the papers [211])- 

Sketches based on random row sampling: Given a probability distribution over 

[n] = {1,..., n}, another choice of sketch is to randomly sample the rows of a data matrix M 
a total of m times with replacement from the given probability distribution. Thus, the rows 
of S are independent and take on the values 

with probability »,• for j = 1 ,..., n 

where Cj G MA is the canonical basis vector. Different choices of the weights {pj}^^i 
are possible, including those based on the row £2 norms pj oc ||Mej||| and leverage values of 
M—i.e., Pj oc ||17ej||2 for j = 1,..., n, where U G is the matrix of left singular vectors 

of M |6]. When M G 'MA^'^ is the adjacency matrix of a graph with d vertices and n edges, the 
leverage scores of M are also known as effective resistances which can be used to sub-sample 
edges of a given graph by preserving its spectral properties |22j . 
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2.3 Gaussian widths 

In this section, we introduce some background on the notion of Gaussian width, a way of 
measuring the size of a compact set in These width measures play a key role in the 
analysis of randomized sketches. Given a compact subset £ C its Gaussian width is given 
by 

W{C) : = Eg[max\{g, z)\] (4) 

zGC 

where g G M” is an i.i.d. sequence of A^(0,1) variables. This complexity measure plays an 
important role in Banach space theory, learning theory and statistics (e.g., imiaia]). 

Of particular interest in this paper are sets C that are obtained by intersecting a given 
cone JC with the Euclidean sphere = {z G M”’ | ||z ||2 = 1}. It is easy to show that 

the Gaussian width of any such set is at most Vd, but the it can be substantially smaller, 
depending on the nature of the underlying cone. For instance, if /C is a subspace of dimension 
r < d, then a simple calculation yields that >V(/C n 5^“^) < y/r. 


3 Newton sketch and local convergence 

With the basic background in place, let us now introduce the Newton sketch algorithm, 
and then develop a number of convergence guarantees associated with it. It applies to an 
optimization problem of the form mina,gc/(x), where / : —)> M is a twice-differentiable 
convex function, and C C is a convex constraint set. 


3.1 Newton sketch algorithm 


In order to motivate the Newton sketch algorithm, recall the standard form of Newton’s 
algorithm: given a current iterate x* G C, it generates the new iterate x*"*"^ by performing a 
constrained minimization of the second order Taylor expansion—viz. 

1 


(5a) 


(5b) 


x^^^ = argmin |-(x — x*, V^/(x*) (x — x*)) + (V/(x^), x — x*)|. 

x£C 12 ) 

In the unconstrained case—that is, when C = —it takes the simpler form 

=X*- [V2/(x‘)]"V/(x*). 

Now suppose that we have available a Hessian matrix square root V^/(x)^/^—that is, a 
matrix V^/(x)^/^ of dimensions n x d such that 

(V^/(x)^/^)^V^/(x)^/^ = V^/(x) for some integer n > rank(V^/(x)). 

In many cases, such a matrix square root can be computed efficiently. For instance, consider 
a function of the form /(x) = g{Ax) where A G and the function g : M” —>■ M has the 

separable form g{Ax) = ®))- ^bis case, a suitable Hessian mat rix square root 

is given by the n x d matrix V^/(x)^/^ : = diag{( 7 ''((aj, x))}”^^H. In Section 
various concrete instantiations of such functions. 

In terms of this notation, the ordinary Newton update can be re-written as 

1 , 


3.2 


we discuss 


x*+^ = 


argmin I - x*)\\l + {Vf{x^), x-x*)|, 

x£C K Z ) 
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and the Newton Sketch algorithm is most easily understood based on this form of the up¬ 
dates. More precisely, for a sketch dimension m to be chosen, let S G be an isotropic 

sketch matrix, satisfying the relation E[S'^S'] = /„. The Newton Sketch algorithm generates 
a sequence of iterates {x*}^q according to the recursion 

: = argminj - x^)\\l + (V/(x*), x - x^) |, (6) 

xgc r z ) 

' -V-' 


where G jg independent realization of a sketching matrix. When the problem 

is unconstrained, i.e., C = and the matrix is invertible, the 

Newton sketch update takes the simpler form to 

= X* - f{x^)^/^{SY V/(x*). (7) 

The intuition underlying the Newton sketch updates is as follows: the iterate corresponds 
to the constrained minimizer of the random objective function <h(x;S'*) whose expectation 
E[‘f>(x; S^)], taking averages over the isotropic sketch matrix 5*, is equal to the original Newton 
objective $(x). Consequently, it can be seen as a stochastic form of the Newton update. 

In this paper, we also analyze a partially sketched Newton update, which takes the following 
form. Given an additive decomposition of the form f = fo + g, we perform a sketch of of the 
Hessian V^/o while retaining the exact form of the Hessian V^g. This leads to the partially 
sketched update 


x*+^ : = argmin I ^ (x - x^)'^QHx - x*) (V/(x*), x - x*)! (8) 

x£C t 2 J 

where Q* :={S^V^fo{x^)^/^fS^V^fo{x^)^/^ + V^g{x^). 

For either the fully sketched or partially sketched updates Q , our analysis shows that 
there are many settings in which the sketch dimension m can be chosen to be substantially 
smaller than n, in which cases the sketched Newton updates will be much cheaper than a 
standard Newton update. For instance, the unconstrained update Q can be computed in 
at most 0{md‘^) time, as opposed to the 0{nd'^) time of the standard Newton update. In 
constrained settings, we show that the sketch dimension m can often be chosen even smaller— 
even m <C d —which leads to further savings. 

3.2 Some examples 

In order to provide some intuition, let us provide some simple examples to which the sketched 
Newton updates can be applied. 

Example 1 (Newton sketch for LP solving). Consider a linear program (LP) in the standard 
form 


min (c, x) (9) 

Ax<b 

where A G is a given constraint matrix. We assume that the polytope {x G | Ax < 6} 
is bounded so that the minimum achieved. A barrier method approach to this LP is based on 
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solving a sequence of problems of the form 


min 


n 

|r(c, x) -^log(6i- {ui, x))|, 
i=l 


fi^) 


where ai G denotes the row of A, and r > 0 is a weight parameter that is adjusted 
during the algorithm. By inspection, the function / : —?• M U {+00} is twice-differentiable, 
and its Hessian is given by V^/(x) = ^^diag{ A Hessian square root is given 

by := diag x)| ) which allows us to compute a sketched version of the 

Hessian square root 


= S diag 



1 

(oj, a:)| 


A. 


With a ROS sketch matrix, computing this matrix requires 0(ndlog(m)) basic operations. 
The complexity of each Newton sketch iteration scales as 0{m(P), where m is at most d. In 
contrast, the standard unsketched form of the Newton update has complexity 0{nd?‘), so that 
the sketched method is computationally cheaper whenever there are more constraints than 
dimensions {n > d). 

By increasing the barrier parameter r, we obtain a sequence of solutions that approach 
the optimum to the LP, which we refer to as the central path. As a simple illustration. 
Figure [T] compares the central paths generated by the ordinary and sketched Newton updates 
for a polytope defined by n = 32 constraints in dimension d = 2. Each row shows three 
independent trials of the method for a given sketch dimension m; the top, middle and bottom 
rows correspond to sketch dimensions m G {d, Ad, 16d} respectively. Note that as the sketch 
dimension m is increased, the central path taken by the sketched updates converges to the 
standard central path. 


As a second example, we consider the problem of maximum likelihood estimation for 
generalized linear models. 

Example 2 (Newton sketch for maximum likelihood estimation). The class of generalized 
linear models (GLMs) is used to model a wide variety of prediction and classification problems, 
in which the goal is to predict some output variable y € y on the basis of a covariate vector 
a G M'^. it includes as special cases the standard linear Gaussian model (in which T = IR), as 
well as logistic models for classification (in which y = { —1,+1}), as well as as Poisson models 
for count-valued responses (in which y = {0,1, 2,...}). See the book [H] for further details 
and applications. 

Given a collection of n observations {(yi,fli)}r=i response-covariate pairs from some 
GLM, the problem of constrained maximum likelihood estimation be written in the form 


mm 

xeC 


n 

=C IE x),yi) |, 


( 10 ) 


2=1 


where ^•IRisa given convex function, and C C is a convex constraint set, chosen 

by the user to enforce a certain type of structure in the solution. Important special cases of 
GLMs include the linear Gaussian model, in which 'ip{u,y) = ^(y —n)^, and the problem (10) 
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Exact Newton 
Newton Sketch 


Trial 1 Trial 2 Trial 3 



(a) sketch size m = d 


Exact Newton 
Newton Sketch 


Trial 1 



Trial 2 Trial 3 




(b) sketch size m = Ad 

-Exact Newton 

- Newton Sketch 


Trial 1 


Trial 2 


Trial 3 



(c) sketch size m = l&d 


Figure 1. Comparisons of central paths for a simple linear program in two dimensions. 
Each row shows three independent trials for a given sketch dimension: across the rows, the 
sketch dimension ranges as m G {d, 4c?, 16c?}. The black arrows show Newton steps taken by 
the standard interior point method, whereas red arrows show the steps taken by the sketched 
version. The green point at the vertex represents the optimum. In all cases, the sketched 
algorithm converges to the optimum, and as the sketch dimension m increases, the sketched 
central path converges to the standard central path. 


corresponds to a regularized form of least-squares, as well as the problem of logistic regression, 
obtained by setting y) = log(l -|- exp(—yu)). 

Letting A G denote the data matrix with Oj G as its row, the Hessian of the 


objective (10) takes the form 


= H^diag A 































Since the function ^ is convex, we are guaranteed that x) > 0, and hence the quantity 

1 7*5 

diag {^"{ajx)) A can be used as an n x d matrix square-root. We return to explore this 


class of examples in more depth in Section 5.1 


3.3 Local convergence analysis using strong convexity 

Returning now to the general setting, we now begin by proving a local convergence guarantee 
for the sketched Newton updates. In particular, this theorem provides insight into how large 
the sketch dimension m must be in order to guarantee good local behavior of the sketched 
Newton algorithm. 

This choice of sketch dimension is determined by geometry of the problem, in particular in 

terms of the tangent cone defined by the optimum. Given a constraint set C and the minimizer 

X* : = argmin/(x), the tangent cone at x* is given by 
x&C 


JC : = {A G I X* +1A G C for some t > O}. 


( 11 ) 


Recalling the definition of the Gaussian width from Section 2.3 our first main result requires 
the sketch dimension to satisfy a lower bound of the form 




( 12 ) 


where e G (0,1) is a user-defined tolerance, and c is a universal constant. Since the Hessian 
square-root has dimensions n x d, this squared Gaussian width is at at most 

min{n,d}. This worst-case bound is achieved for an unconstrained problem (in which case 
K, = M'^), but the Gaussian width can be substantially smaller for constrained problems. See 
the example following Theorem for an illustration. 

In addition to this Gaussian width, our analysis depends on the cone-constrained eigen¬ 
values of the Hessian V^/(x*), which are defined as 

7 = \ni {z,V‘^f{x*))z), and (3= sup {z, f{x*))z), (13) 

zeicnsd-i z&jcns‘i-^ 


In the unconstrained case (C = M“), we have /C = and so that 7 and [3 reduce to the 
minimum and maximum eigenvalues of the Hessian V^/(x*). In the classical analysis of 
Newton’s method, these quantities measure the strong convexity and smoothness parameters 
of the function /. 

With this set-up, the following theorem is applicable to any twice-differentiable objective 


/ with cone-constrained eigenvalues ( 7 ,/3) defined in equation (13), and with Hessian that is 
L-Lipschitz continuous, as defined in equation ([^. 

Theorem 1 (Local convergence of Newton Sketch). For given parameters 6,€ G (0,1), con¬ 
sider the Newton sketch updates based on an initialization x^ such that \\x^ — x *\\2 < 
and a sketch dimension m satisfying the lower bound (12). Then with probability at least 
1 — ffiQ ^ 2 -error satisfies the recursion 


I t+i *11 ^ ^ w t *11 I t *112 

\x —X 2 < e— kc —X bH-p —X U- 

7 7 


(14) 


The bound (14) shows that when e is set to a fixed constant—say e = 1/4—the algo¬ 
rithm displays a linear-quadratic convergence rate in terms of the error A* = x^ — x*. More 
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specifically, the rate is initially quadratic—that is, ||A *+^||2 ~ ^||A *||2 when ||A *||2 is large. 
However, as the iterations progress and ||A ^||2 becomes substantially less than 1, then the 
rate becomes linear—meaning that ||A *'’'^||2 ~ e^||A*|| 2 —since the term 4 ^||A *||2 becomes 

negligible compared to e^||A*|| 2 . If we perform N steps in total, the linear rate guarantees 
the conservative error bounds 


I N * 

\x — X 






8L V2 


7 


j , and f{x^) - f{x*) < 




8L\2 


N 


(15) 


A notable feature of Theorem is that, depending on the structure of the problem, the 
linear-quadratic convergence can be obtained using a sketch dimension m that is substantially 
smaller than min{n, d}. As an illustrative example, we performed simulations for some instan¬ 
tiations of a portfolio optimization problem: it is a linearly-constrained quadratic program of 
the form 


mm 

x>0 


[^x'^A^Ax - (c, x)|. 


(16) 


E 


where A G 


tion 


DTlXd 


and c G are empirically estimated matrices and vectors (see Sec- 
5.3 for more details). We used the Newton sketch to solve different sizes of this problem 


d G {10,20,30,40,50,60}, and with n = d^ in each case. Each problem was constructed so 
that the optimum x* had at most s = [21og((i)] non-zero entries. A calculation of the Gaus¬ 
sian width for this problem (see Appendix [C] for the details) shows that it suffices to take a 
sketch dimension m ^ slogd, and we implemented the algorithm with this choice. Figure 


Convergence rate of Newton Sketch 



Figure 2. Empirical illustration of the linear convergence of the Newton sketch algorithm for 
an ensemble of portfolio optimization problems (16). In all cases, the algorithm was imple¬ 
mented using a sketch dimension m = [4s log d], where s is an upper bound on the number of 
non-zeros in the optimal solution x*] this quantity satisfies the required lower bound (12), and 
consistent with the theory, the algorithm displays linear convergence. 


shows the convergence rate of the Newton sketch algorithm for the six different problem sizes: 
consistent with our theory, the sketch dimension m <C minjd, n} suffices to guarantee linear 
convergence in all cases. 

It is also possible obtain an asymptotically super-linear rate by using an iteration-dependent 
sketching accuracy e = e(t). The following corollary summarizes one such possible guarantee: 
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Corollary 1. Consider the Newton sketch iterates using the iteration-dependent sketching 
accuracy e{t) = • Then with the same probability as in Theoremln we have 


— X* 


12 < 


1 


log(l + i) 7 


l^u t 
— X — X 


2 H-X - X 

7 


|2 

l2) 


and consequently, super-linear convergence is obtained — namely, limt_>.oo = 0 . 

Note that the price for this super-linear convergence is that the sketch size is inflated by the 
factor e~‘^{t) = log^(l -|- t), so it is only logarithmic in the iteration number. 


4 Newton sketch for self-concordant functions 

The analysis and complexity estimates given in the previous section involve the curvature 
constants ( 7 , /3) and the Lipschitz constant L, which are seldom known in practice. Moreover, 
as with the analysis of classical Newton method, the theory is local, in that the linear-quadratic 
convergence takes place once the iterates enter a suitable basin of the origin. 

In this section, we seek to obtain global convergence results that do not depend on unknown 
problem parameters. As in the classical analysis, the appropriate setting in which to seek 
such results is for self-concordant functions, and using an appropriate form of backtracking 
line search. We begin by analyzing the unconstrained case, and then discuss extensions to 
constrained problems with self-concordant barriers. In each case, we show that given a suitable 
lower bound on the sketch dimension, the sketched Newton updates can be equipped with 
global convergence guarantees that hold with exponentially high probability. Moreover, the 
total number of iterations does not depend on any unknown constants such as strong convexity 
and Lipschitz parameters. 

4.1 Unconstrained case 

In this section, we consider the unconstrained optimization problem min 2 ,g]gd /(x), where / is 
a closed convex self-concordant function which is bounded below. Note that a closed convex 
function (/> : M —>■ M is self-concordant if 

|</)"'(x)| <2(0"(x))"/^ (17) 

This definition can be extended to a function / : —>• M by imposing this requirement on the 
univariate functions 4>x,yit) ■ = f{x + ty), for all choices of x, y in the domain of /. Examples 
of self-concordant functions include linear and quadratic functions and negative logarithm. 
Self concordance is preserved under addition and affine transformations. 

Our main result provide a bound on the total number of Newton sketch iterations required 
to obtain a d-accurate solution without imposing any sort of initialization condition (as was 
done in our previous analysis). This bound scales proportionally to log(l/(5) and inversely in 
a parameter u that depends on sketching accuracy e G ( 0 , |) and backtracking parameters 
(a, b) via 


V = ab 




where 


1 

8 ( 1^)3 


(18) 
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Algorithm 1 Unconstrained Newton Sketch with backtracking line search 

Input: Starting point tolerance (5 > 0, (a, 6) line-search parameters, sketching matrices G 

1: Compute approximate Newton step Ax^ and approximate Newton decrement A(x) 


: = argmm (V/(x*), A> + i||S*(VV(^‘))'''"A||i; 
\f{x*) := V/(a:)^Ax‘. 


2: Quit if A(a:*)^/2 < 5. 

3: Line search: choose fi : while /(x* + fiAx^) > f{x*) + afj,X{x^), fi <— bfi 
4: Update; a;*+^ = x* + imAx^ 

Output: minimizer x*, optimality gap A(x*) 


Theorem 2. Let f be a strietly convex self-concordant function. Given a sketching matrix 
S G ^ixixn ^ > p-max 2 ;g(j rank(V^/(x)) = ^ d, the number of total iterations T for 
obtaining an 5 approximate solution in function value via Algorithm^ is bounded by 


r= /i"° . )-/KU o.651og,(^), 


with probability at least 1 — ciNe 


The bound in the above theorem shows that the convergence of the Newton Sketch is 
independent of the properties of the function / and problem parameters, similar to classical 
Newton’s method. Note that for problems with n > d, the complexity of each Newton sketch 
step is at most 0{d^ + ndlogd), which is smaller than that of Newton’s Method {0{nd‘^)), 
and also smaller than typical first-order optimization methods {0{nd)) whenever n > d‘^. 


4.2 Newton Sketch with self-concordant barriers 

We now turn to the more general constrained case. Given a closed, convex self-concordant 
function /q : M'’* —>■ M, let C be a convex subset of M'’*, and consider the constrained optimization 
problem foi^). If we are given a convex self-concordant barrier function g for the 

constraint set C, it is equivalent to consider the unconstrained problem 

min I /o(x) + g{x) 
xeiR'* t'- V -^ ^ 

/G) 

One way in which to solve this unconstrained problem is by sketching the Hessian of both 
/o and g, in which case the theory of the previous section is applicable. However, there are 
many cases in which the constraints describing C are relatively simple, and so the Hessian of 
g is highly-structured. For instance, if the constraint set is the usual simplex (i.e., x > 0 and 
(1, x) < 1), then the Hessian of the associated log barrier function is a diagonal matrix plus 
a rank one matrix. Other examples include problems for which g has a separable structure; 
such functions frequently arise as regularizers for ill-posed inverse problems. Examples of such 
regularizers include £2 regularization g{x) = 5 ||x|||, graph regularization g{x) = ^ 

Xj)'^ induced by an edge set E (e.g., hnite differences) and also other differentiable norms 

9{x) = {j2i=i for 1 < p < 00 . 

In all such cases, an attractive strategy is to apply a partial Newton sketch, in which we 
sketch the Hessian term V^/o(x) and retain the exact Hessian V^g'(x), as in the previously 
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described updates Q. More formally, Algorithm [^provides a summary of the steps, including 
the choice of the line search parameters. The main result of this section provides a guarantee 
on this algorithm, assuming that the sequence of sketch dimensions {m*}^Q is appropriately 
chosen. 


Algorithm 2 Newton Sketch with self-concordant barriers 

Input: Starting point ai®, constraint C, corresponding barrier function g such that / = /q + ^, tolerance (5 > 0, (ct,/3) 
line-search parameters, sketching matrices G 
1: Compute approximate Newton step and approximate Newton decrement A(x). 

Ax* : = arg min (V/(x*), A> -h AWl + ^A'^\7^gix^)A-, 

xi+AGC 2 2 

Xf{x^) : = V/(x)^Ax^ 


2: Quit if A(x^)^/2 < 5. 

3: Line search: choose : while /(x* -h /rAx^) > /(x^) -|- a{iX{x^), n ^ I3fi. 
4: Update: x*+^ = x^ fiAx^. 

Output: minimizer x*, optimality gap A(x^). 


The choice of sketch dimensions depends on the tangent cones dehned by the iterates, 
namely the sets 

/C* : = {a G M** I + aA G C for some a > O}. 

For a given sketch accuracy e G (0,1), we require that the sequence of sketch dimensions 
satisfies the lower bound 


m* > max>V^(V^/(x)^'^^/C*). 


(19) 


Finally, the reader should recall the parameter v was defined in equation (18), which depends 
only on the sketching accuracy e and the line search parameters. Given this set-up, we have 
the following guarantee: 

Theorem 3. Let / : — )• M 6e a convex and self-concordant function, and let g : ^ M.U {-|-oo} 

be a convex and self-concordant barrier for the convex set C. Suppose that we implement 
Algorithm^ with sketch dimensions {m^}t>o satisfying the lower bound (19). Then taking 

f(x^^ — fix*! 1 

N = - - -I- 0.65 logo (—A iterations, 


suffices to obtain 5-approximate solution in function value with probability at least 1—ciNe . 


Thus, we see that the Newton Sketch method can also be used with self-concordant barrier 
functions, which considerably extends its scope. Section [5.5| provides a numerical illustration 
of its performance in this context. As we discuss in the next section, there is a flexibility in 
choosing the decomposition /o and g corresponding to objective and barrier, which enables 
us to also sketch the constraints. 


4.3 Sketching with interior point methods 

In this section, we discuss the application of Newton Sketch to a form of barrier or interior 
point methods. In particular we discuss two different strategies and provide rigorous worst- 
case complexity results when the functions in the objective and constraints are self-concordant. 
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Algorithm 3 Interior point methods using Newton Sketch 

Input: Strictly feasible starting point initial parameter s.t. r := > 0, ^ > 1, tolerance 5 > 0. 

1: Centering step: Compute x(t) by Newton Sketch with backtracking line-search initialized at x 
using Algorithm^ or Algorithm]^ 

2: Update x := x(t). 

3: Quit if r/r < 5. 

4: Increase r by r := /ir. 

Output: minimizer x{t). 


More precisely, let us consider a problem of the form 


min /o(x) subject to gj{x) <0 for j = 1,...,r, (20) 

where /o and are twice-differentiable convex functions. We assume that there exists 

a unique solution x* to the above problem. 

The barrier method for computing x* is based on solving a sequence of problems of the 
form 

r 

x(r) := argmin |r/o(x) - W log(-c/j(x)) [, (21) 

j=i 


for increasing values of the parameter r > 1. The family of solutions {ai(T)}T->i trace out 
what is known as the central path. A standard bound (e.g., 0) on the sub-optimality of x{t) 
is given by 

/o(T(r)) - fo{x*) < 

T 


The barrier method successively updates the penalty parameter r and also the starting points 
supplied to Newton’s method using previous solutions. 

Since Newton’s method lies at the heart of the barrier method, we can obtain a fast version 
by replacing the exact Newton minimization with the Newton sketch. Algorithm provides 
a precise description of this strategy. As noted in Step 1, there are two different strategies in 
dealing with the convex constraints gj(x) < 0 for j = 1,..., r: 


Full sketch: Sketch the full Hessian of the objective function (21) using Algorithm [^, 


• Partial sketch: Sketch only the Hessians corresponding to a subset of the functions 
{/o; gj^j = 1) • ■ ■ and use exact Hessians for the other functions. Apply Algorithm]^ 

As shown by our theory, either approach leads to the same convergence guarantees, but 
the associated computational complexity can vary depending both on how data enters the 
objective and constraints, as well as the Hessian structure arising from particular functions. 
The following theorem is an application of the classical results on the barrier method tailored 
for Newton Sketch using any of the above strategies (see e.g., 0)- As before, the key parameter 
u was defined in Theorem [H 


Theorem 4 (Newton Sketch complexity for interior point methods). For a given target ac¬ 
curacy S G (0,1) and any g > 1, the total number of Newton Sketch iterations required to 
obtain a 5-accurate solution using Algorithm\^is at most 


log {r/{T^6) 
logg 


f r{g-l- logg) 
V 7 


-|- 0.65 log2( 



( 22 ) 
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If the parameter /x is set to minimize the above upper-bound, the choice ^ = 1 + ^ yields 0{y/r) 
iterations. However, when applying the standard Newton method, this “optimal” choice is 
typically not used in practice: instead, it is common to use a fixed value of /x G [2,100]. 
In experiments, experience suggests that the number of Newton iterations needed is a con¬ 
stant independent of r and other parameters. Theorem [^ allows us to obtain faster interior 
point solvers with rigorous worst-case complexity results. We show different applications of 
Algorithm in the following section. 


5 Applications and numerical results 

In this section, we discuss some applications of the Newton sketch to different optimization 
problems. In particular, we show various forms of Hessian structure that arise in applications, 
and how the Newton sketch can be computed. When the objective and/or the constraints 
contain more than one term, the barrier method with Newton Sketch has some flexibility in 
sketching. We discuss the choices of partial Hessian sketching strategy in the barrier method. 
It is also possible to apply the sketch in the primal or dual form, and we provide illustrations 
of both strategies here. 


5.1 Estimation in generalized linear models 


Recall the problem of (constrained) maximum likelihood estimation for a generalized linear 
model, as previously introduced in Example It leads to the family of optimization prob¬ 
lems (10): here x/ : M —)• M is a given convex function arising from the probabilistic model, and 
C C M^s a closed convex set that is used to enforce a certain type of structure in the solution. 
Popular choices of such constraints include £i-balls (for enforcing sparsity in a vector), nuclear 
norms (for enforcing low-rank structure in a matrix), and other non-differentiable semi-norms 
based on total variation (e.g., l^i+i useful for enforcing smoothness or clustering 

constraints. 


Suppose that we apply the Newton sketch algorithm to the optimization problem (10). 


Given the current iterate x^, computing the next iterate requires solving the constrained 
quadratic program 


1 


mm 

x&C 


^Il'S'diag {'Ip''{{at, x^),yi)Y^‘^ A{x - x 


\l + p'{{ai, x''),yi)) 

i=\ 


When the constraint C is a scaled version of the ^i-ball—that is, C = {x G 
for some radius R > 0—the convex program (23) is an instance of the Lasso program 


(23) 


2:||i < R} 


for which there is a very large body of work. For small values of R, where the cardinality of 
the solution x is very small, an effective strategy is to apply a homotopy type algorithm, also 
known as LARS 13 E], which solves the optimality conditions starting from R = 0. For other 
sets C, another popular choice is projected gradient descent, which is efficient when projection 
onto C is computationally simple. 

Focusing on the ^i-constrained case, let us consider the problem of choosing a suitable 
sketch dimension m. Our choice involves the £i-restricted minimal eigenvalue of the data 
matrix A, which is given by 


Is (A ■ = 


mm 

:||i<2v^ 


||Az 


(24) 
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Note that we are always guaranteed that 7 ^ (A) > It also involves certain quan¬ 

tities that depend on the function ip, namely 

: = min min ip”{{at, x),yi), and ip'^^^ : = max max ip”{{ai, x),yi), 


x£C i=l,...,n 


x^C 


where ai G is the row of A. With this set-up, supposing that the optimal solution x* 
has cardinality at most ||x*||o < s, then it can be shown (see Lemmain Appendix]^ that 
it suffices to take a sketch size 


m = Co 


,,, max 
V’max 


14 -112 
II 2 


s log d, 


(25) 


V'min 7s {A) 

where cq is a universal constant. Let us consider some examples to illustrate: 

• Least-Squares regression: ip{u) = ip"[u) = 1 and V’min ~ V’max = I- 

• Poisson regression: ip{u) = e“, ip''{u) = e“ and 

• Logistic regression: iP{u) = log(l+e“), ip"{u) = and ^ 


o~^^n 


where A 


max • — max \\CLi\ 


and Ar. 


:= mm ||ad 

2=l,...,n 


For typical distributions of the data matrices, the sketch size choice given in equation (25) 


is O(slogd). As an example, consider data matrices A G where each row is indepen¬ 

dently sampled from a sub-Gaussian distribution with variance 1. Then standard results on 
random matrices |24j show that 77 (A) > 1/2 as long as n > cislogd for a sufficiently large 
constant ci. In addition, we have max ||Aj ||2 = 0{n), as well as = 0(log(n)). For such 

problems, the per iteration complexity of Newton Sketch update scales as 0{s^dlog‘^{d)) using 
standard Lasso solvers (e.g., m ) or as 0{sdlog{d)) using projected gradient descent. Both 
of these scalings are substantially smaller than conventional algorithms that fail to exploit 
the small intrinsic dimension of the tangent cone. 


5.2 Semidefinite programs 

The Newton sketch can also be applied to semidefinite programs. As one illustration, let 
us consider the metric learning problem studied in machine learning. Given feature vectors 
ai,...an G and corresponding indicator yij G {—1,-|-1}” where yij = -|-1 if ai and aj 
belong to the same label and y^j = —1 otherwise for all i ^ j and 1 < i, j < n. The task is 
to learn a positive semidefinite matrix X which represents a metric such that the semi-norm 
||a||x := {a, Xa) establishes a nearness measure depending on class label. Using ^ 2 -loss, 

the optimization can be stated as the following semi-definite program (SDP) 

(2) 

mm I ^ ((A, {ai - Oj)(aj - aj)'^) - yijf + Atrace(A)|. 

Here the term trace(A), along with its multiplicative pre-factor A > 0 that can be adjusted 
by the user, is a regularization term for encouraging a relatively low-rank solution. Using 
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the standard self-concordant barrier X i—)■ logdet(X) for the PSD cone, the barrier method 
involves solving a sequence of snb-problems of the form 


mm 


n 

I T ((X, aittf) — yi)^ -|- tA trace X — logdet (X) 


2=1 


/(vec(X)) 

Now the Hessian of the function vec(X) i—>■ /(vec(X)) is a x matrix given by 

(2) 

V^/(vec(X)) = T ^ vec(Hjj)vec(Hjj)^ -|- X~^ ® 

where Aij := [ai — aj){ai — ajY'. Then we can apply the barrier method with partial Hessian 
sketch on the first term, {Sijvec{Aij)}ijLj and exact Hessian for the second term. Since the 
vectorized decision variable is vec(X) G the complexity of Newton Sketch is 0{m?d'^) 
while the complexity of a classical SDP interior-point solver is 0{nd'^). 

5.3 Portfolio optimization and SVMs 

Here we consider the Markowitz formulation of the portfolio optimization problem [TH]. The 
objective is to find x G belonging to the unit simplex, which corresponds to non-negative 
weights associated with each of d possible assets, so as to maximize the expected return minus 
a coefficient times the variance of the return. Letting /r G denote a vector corresponding to 
mean return of the assets, and we let S G be a symmetric, positive semidefinite matrix, 
covariance of the returns. The optimization problem is given by 


max \ (n, x) — Ax^Sx >. 


(26) 


The covariance of returns is often estimated from past stock data via empirical covariance, 
S = A"^A where the columns of A are time series corresponding to assets normalized by y/n, 
where n is the length of the observation window. 

The barrier method can be used solve the above problem by solving penalized problems 
of the 

d 

min \ —Tfd^x + tXx^A^Ax — 7 log((ei, x)) — log(l — ( 1 , x)) >, 

2 = 1 


f(^) 

where e* G is the element of the canonical basis and 1 is row vector of all-ones. Then 
the Hessian of the above barrier penalized formulation can be written as 

V^/(x) = tA -|-diag (x^) ^-|- 

Consequently we can sketch the data dependent part of the Hessian via tXSA which has at 
most rank m and keep the remaining terms in the Hessian exact. Since the matrix 11^ is rank 
one, the resulting sketched estimate is therefore diagonal plus rank {m + 1 ) where the matrix 
inversion lemma can be applied for efficient computation of the Newton Sketch update (see 
e.g. [ 8 ]). Therefore, as long as m < d, the complexity per iteration scales as O^md?), which is 
cheaper than the 0{nd‘^) per step complexity associated with classical interior point methods. 
We also note that support vector machine classification problems with squared hinge loss also 
has the same form as in (26) (see e.g. pQ]) where the same strategy can be applied. 
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5.4 Unconstrained logistic regression with d n 

Let us now turn to some numerical comparisons of the Newton Sketch with other popular 
optimization methods for large-scale instances of logistic regression. More specifically, we 
generated a feature matrix A G based on d = 100 features and n = 16384 observations. 
Each row ai G was generated from the d-variate Gaussian distribution A^(0, S) where 
Sjj = 2|0.99|*“L As shown in Figure]^ the convergence of the algorithm per iteration is very 
similar to Newton’s method. Besides the original Newton’s method, the other algorithms 
compared are 

• Gradient Descent (GD) with backtracking line search 

• Accelerated Gradient Descent (Acc. GD) adapted for strongly convex functions with 
manually tuned parameters. 

• Stochastic Gradient Descent (SGD) with the classical step size choice 1 /\/t 

• Broyden-Fletcher-Goldfarb-Shanno algorithm (BFGS) approximating the Hessian with 
gradients. 

For each problem, we averaged the performance of the randomized algorithms (Newton sketch 
and SGD) over 10 independent trials. We ran the Newton sketch algorithm with sketch size 
m = 64. To be fair in comparisons, we performed hand-tuning of the stepsize parameters in 
the gradient-based methods so as to optimize their performance. The top panel in Figure]^ 
plots the log duality gap versus the number of iterations: as expected, on this scale, the 
classical form of Newton’s method is the fastest, whereas the SGD method is the slowest. 
However, when the log optimality gap is plotted versus the wall-clock time in the bottom 
panel, we now see that the Newton sketch is the fastest. 

5.5 A dual example: Lasso with n 

The regularized Lasso problem takes the form min {i \\Ax — yWo + A||x||i}, where A > 0 

is a user-specified regularization parameter. In this section, we consider efficient sketching 
strategies for this class of problems in the regime d ^ n. In particular, let us consider the 
corresponding dual program, given by 

max I-ll'V — tcllo]-. 

||yir^||,„<A I 2"^ "U 

By construction, the number of constraints d in the dual program is larger than the number 
of optimization variables n. If we apply the barrier method to solve this dual formulation, 
then we need to solve a sequence of problems of the form 

d d 

min I r||y-u;||^ - Vlog(A - {Aj, w)) - V log(A-h (A^, w)) |, 

j=i j=i 

f(^) 

where Aj G M” denotes the column of A. The Hessian of the above barrier penalized 
formulation can be written as 

VV(u.) = + .4diag + dldiag (' 
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Figure 3. Newton Sketch algorithm outperforms other popular optimization methods. Plots of 
the log optimality gap versus iteration number (top) and plots of the log optimality gap versus 
wall-clock time (bottom). Newton Sketch empirically provides the best accuracy in smallest 
wall-clock time, and does not require knowledge of problem-dependent quantities (such as 
strong convexity and smoothness parameters). 


Consequently we can keep the first term in the Hessian, tI exact and apply partial sketching 
to the Hessians of the last two terms via 

(lA-Mj.u.)! + |A + M,,«.)|) ■ 

Since the partially sketched Hessian is of the form tin + VV'^, where V is rank at most 
m, we can use matrix inversion lemma for efficiently calculating Newton Sketch updates. 
The complexity of the above strategy for d > n is 0{dm?), where m is at most d, whereas 
traditional interior point solvers are typically 0{dn‘^) per iteration. 

In order to test this algorithm, we generated a feature matrix A G with d = 4096 

features and n = 50 observations. Each row Oj G was generated from the multivariate 
Gaussian distribution N{0,Ti) with Yiij = 2 * |0.99|*“-^. For a given problem instance, we 
ran 10 independent trials of the sketched barrier method, and compared the results to the 
original barrier method. Figure plots the the duality gap versus iteration number (top 
panel) and versus the wall-clock time (bottom panel) for the original barrier method (blue) 
and sketched barrier method (red): although the sketched algorithm requires more iterations, 
these iterations are cheaper, leading to a smaller wall-clock time. This point is reinforced by 
Figure where we plot the wall-clock time required to reach a duality gap of 10“® versus the 
number of features n in problem families of increasing size. Note that the sketched barrier 
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method outperforms the original barrier method, with significantly less computation time for 
obtaining similar accuracy. 




Figure 4. Plots of the duality gap versus iteration number (top panel) and duality gap versus 
wall-clock time (bottom panel) for the original barrier method (blue) and sketched barrier 
method (red). The sketched interior point method is run 10 times independently yielding 
slightly different curves in red. While the sketched method requires more iterations, its overall 
wall-clock time is much smaller. 


6 Proofs 

We now turn to the proofs of our theorems, with more technical details deferred to the 
appendices. 


6.1 Proof of Theorem [T] 

Throughout this proof, we let r G S'^~^ denote a fixed vector that is independent of the sketch 
matrix S'* and the current iterate x^. We then define the following pair of random variables 


Zi(S; x) : 
Z 2 (S; x) : 


sup {w, [S'^S — l)r), 

«)ev2/(x)i/2x:n5"-i 

inf IlStcIll. 

«)ev2/(x)i/2x:n5"-i 


These random variables are significant, because the core of our proof is based on establishing 
that the error vector A* = x* — x* satisfies the recursive bound 


||A*+1||2< 


7 ^^ 




4L 

7^ 


1^ Il2’ 


(27) 
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Wall-clock time for obtaining accuracy lE-6 



Figure 5. Plot of the wall-clock time in seconds for reaching a duality gap of 10“® for the 
standard and sketched interior point methods as n increases (in log-scale). The sketched interior 
point method has significantly lower computation time compared to the original method. 


where Z\ := x^) and Z 2 ■= Z 2 {S^\ x*). We then combine this recursion with the 

following probabilistic guarantee on Z\ and Z^- For a given tolerance parameter e G (0, |], 
consider the ’’good event” 

< |, and > l-e|. (28) 

Lemma 1 (Sufficient conditions on sketch dimension |20j). 

(a) For sub-Gaussian sketch matrices, given a sketch size m > % m&yix^c W^(V^/(x)^/^/C), 
we have 


F[S^] > 


(29) 


(b) For randomized orthogonal system (ROS) sketches over the class of self-bounding cones, 
given a sketch size m > ^ max^^c VV^(V^/(x)^/^/C), we have 


F[£^] > 1 - cie 


(30) 


Combining Lemma with the recursion (27) and re-scaling e appropriately yields the claim 
of the theorem. 

Accordingly, it remains to prove the recursion (27), and we do so via a basic inequality 
argument. Recall the function x 1 —)■ ^{x;S*) that underlies the sketch Newton update ([^: 
since x* and x* are optimal and feasible for the constrained optimization problem, we have 
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5*) < 5*). Introducing the error vector A* : = x^—x*, some straightforward algebra 

then then leads to the basic inequality 

(31) 

Let us first upper bound the right-hand side. By using the integral form of Taylor’s expansion, 
we have (V/(x*) — V/(x*), A*+^) = Jq (V^f(x* + u(x* — x*))A*, A*+^)du, and hence 


RHS = / ( [v^f(xy/^(S^fS*V^f(xy/^ - V^f(x* + u(x* - x*))! A*, A*+^}d' 


u 


By adding and subtracting terms and then applying triangle inequality, we have the bound 
RHS < Ti + Ts, where 
-1 


Ti : = 

T2 : = 


{\v‘^f{x^f/^{{S^)'^S^ - /)v2/(xO^/^J A*, A*+^) , and 
||VV(a;* + u{x* - X*)) - V^f{x^)\l,pdu ||A^|| 2 ||A*+^|| 2 . 


Now observe that the vector r := is independent of the randomness in S^, 

whereas the vector belongs to the cone V^/(x*)^/^/C. Consequently, by the 

definition of Zi, we have 


Ti < Zl\\V‘^f{x^)^/'^A^\\2\\V‘^f{x^Y^^A^+^\\2. 


(32) 


Now note that using the fact that /3 controls the smoothness of the gradient and the Lipschitz 
continuity of Hessian we can upper bound the terms on the above right-hand side as follows 

(A^ VV(x*)A*) = (A^ V^f{x*)A^) + (A*, (vV(x*) - V^f{x*)) A*) 


< {/3 + L||A'||2} ||A 


t\\2 

h 1 


and similarly, (A*+^, V^/(x*)A*+^) < {/3 -|- L||A*|| 2 } ||A*+^|| 2 . Combining the above bounds 
with (32) we obtain 

Ti <Z{{^ + L||A*||2} ||A‘+1||2||A‘||2 . (33) 

On the other hand, by the L-Lipschitz condition on the Hessian, we have 

r2 <L||Ai2||A*+i||2. 

Substituting these two bounds into our basic inequality, we have 

<Zi{^ + T||A*||2} ||A*||2 ||A*+i2 + I|A*+^||2. (34) 

Our final step is to lower bound the left-hand side (LHS) of this inequality. By definition of 
Z 2 , we have 


||5*V2/(a;*)^/^A*+i||2 


> (A*+\ vV(x*)A‘+^) 

= (A*+\ vV(a;*)A*+^) + (A*+\ (vV(x*) - V^f{x*))A^+^) 
>{7-T||Ai2}||A*+i||i. 
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Substituting this lower bound into the previous inequality (34) and then rearranging, we find 
that, as long as ||A *||2 < we also have ||A *||2 < ^ and consequently 


k I 


|2 < 


< 


2Zl{P + L\\A%) 
Zl{^-L\\A%) 
6/3^1 n.tn , 4L 


^ II2 + 


1 ^ 1125 


2L 


{j-L\\A%)Zl 


1^ II 2 


as claimed. 


6.2 Proof of Theorem [ 2 ] 


Recall that in this case, we assume that / is a self-concordant strictly convex function. We 
adopt the following notation and conventions from the book [18]. For a given x G we 
define the pair of dual norms 

\\u\\x := {V‘^f{x)u, and ||u||* : = (V^/(x)“^t>, , 

as well as the Newton decrement 

Xf{x) = {V^fix)-^Vf{x), V/(x))V2 = ||v2/(^)-iv/(x)|U = ||V2 /(x)-V2v/(x)||2. 

Note that V^/(x)“^ is well-defined for strictly convex self-concordant functions. In terms of 
this notation, the exact Newton update is given by x 1 —>■ Xne ■ = x + v, where 

Une : = arg niin | h\X7‘^z\\l + {z, V/(x)) |, (35) 

zGC—x K Z ) 

^-V-^ 

whereas the Newton sketch update is given by x (->■ Xnsk : = x -|- Xnsk, where 

Xnsk : = arg min \h\SV‘^f{x)^^‘^z\\l + {z, V/(x))| . (36) 

The proof of Theorem]^ given in this section involves the unconstrained case (C = M'^), whereas 
the proofs of later theorems involve the more general constrained case. In the unconstrained 
case, the two updates take the simpler forms 

Xnb = X - (V2/(x))-^V/(x), and Xnsk = x - f{xf/^SV^f{xY/^)-^Vf{x). 


For a self-concordant function, the sub-optimality of the Newton iterate Xne in function 
value satisfies the bound 

/(s^ne) - niin /(x) < [A/(xne)]^ • 

'-V-" 

fix*) 

This classical bound is not directly applicable to the Newton sketch update, since it involves 
the approximate Newton decrement A/(x) := —(V/(x), Unsk), as opposed to the exact one 
Xf{x) : = —(V/(x), Une)- Thus, our strategy is to prove that with high probability over the 
randomness in the sketch matrix, the approximate Newton decrement can be used as an exit 
condition. 


Recall the definitions (35) and (36) of the exact Une and sketched Newton Xnsk update 


directions, as well as the definition of the tangent cone JC at x € C. Let KX be the tangent 
cone at x*. The following lemma provides a high probability bound on their difference: 
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Lemma 2. Let S G a sub-Gaussian or ROS sketch matrix, and consider any fixed 

vector X € C independent of the sketch matrix. If m > ^ K. ) ^ then 


f {VmSK ~ Vne) 


< e 




(37) 


with probability at least 1 — cie . 

Similar to the standard analysis of Newton’s method, our analysis of the Newton sketch 
algorithm is split into two phases defined by the magnitude of the decrement Xf{x). In 
particular, the following lemma constitute the core of our proof: 

Lemma 3. For e G (0,1/2), there exist constants n > 0 and rj G (0,1/16) such that: 

(a) If Xf{x) > rj, then f{XffSK) — f{x) < —n with probability at least 1 — . 

(b) Conversely, if Xf{x) < tj, then 

Xfix^sx) < Xf{x), and (38a) 

16 

A/(xjvsk) < (—)A/(x), (38b) 

2 

where both bounds hold with probability 1 — . 

Using this lemma, let us now complete the proof of the theorem, dividing our analysis into 
the two phases of the algorithm. 


First phase analysis: By Lemma j^a) each iteration in the hrst phase decreases the func¬ 
tion value by at least > 0, the number of first phase iterations Ni is at most 

/(x°) - f{x*) 

V ’ 


with probability at least 1 — N\C\e 


Second phase analysis: Next, let us suppose that at some iteration 
Xf{x^) < rj holds, so that part (b) of Lemma can be applied. In fact, 
then guarantees that Xf{x^^^) < rj, so that we may apply the contraction 
peatedly for N 2 rounds so as to obtain that 


t, the condition 
the bound (38a) 
bound (|38b) re- 




with probability 1 — N 2 Cie^'^^. 

Since Xf{x^) < rj < 1/16 by assumption, the self-concordance of / then implies that 


/(x*+^) - fix*) < 


716 ^ 1 

V^y 16 ' 


Therefore, in order to ensure that and consequently for achieving /(a:*+^) — fix*) < e, it 
suffices to the number of second phase iterations lower bounded as N 2 > 0.65 log 2 ( jg^). 
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Putting together the two phases, we conclude that the total number of iterations N re¬ 
quired to achieve e- accuracy is at most 

N = N, + N2 < /(^°)-/(^*) + 0.65log 

7 Ibe 

2 

and moreover, this guarantee holds with probability at least 1 — N. 

The final step in our proof of the theorem is to establish Lemma and we do in the next 
two subsections. 

6.2.1 Proof of Lemma l^a) 

Our proof of this part is performed conditionally on the event V : = {Aj(x) > r/}. Our strategy 
is to show that the backtracking line search leads to a stepsize s > 0 such that function 
decrement in moving from the current iterate x to the new sketched iterate x^sk = x + sunsk 
is at least 


/(xnsk) — /(x) < —u with probability at least 1 — cie 


(39) 


The outline of our proof is as follows. Defining the univariate function g{u) : = /(x-|-uunsk) 
and e' = we first show that u = —^ ^ ~ ^ satisfies the bound 


l+(l-|-e')Ay(a;) 

g{u) < 5 ( 0 ) - auXfixf, 


(40a) 


which implies that u satisfies the exit condition of backtracking line search. Therefore, the 
stepsize s must be lower bounded as s > bu, which then implies that the updated solution 
3 Jnsk = X + sunsk satisfies the decrement bound 


/(a^Nsu) - f{x) < -ah- 




(40b) 


1 + (1 + i^)^/(a;) 

2 

Since A/(x) > rjhy assumption and the function u —>• ^ ^ is monotone increasing, this 
bound implies that inequality (39) holds with u = ab 




It remains to prove the claims (40a) and (40b), for which we make use of the following 
auxiliary lemma: 

Lemma 4. For u G dom^ n M"*", we have the decrement bound 
g{u) < g(0) +u(Vf(x), v „ sk ) - u\\[V'^ f{x)]^^'^VffSK\\2 - log {l - u\\[V^ f{x)]^/‘^VffSKh) ■ (41) 
provided that u\\[V‘^ f{x)]^^'^Vi^sK\\2 < 1- 
Lemma 5. With probability at least 1 — we have 




1 + e 
1 - e 


[A/(x)]^ 


(42) 


25 












The proof of these lemmas are provided in Appendices A.2 and A.3 Using them, let us 
prove the claims (40a) and (40b). Recalling our shorthand e' : = — 1 = , substituting 

inequality (42) into the decrement formula (41) yields 

g{u) < g(0) - uAf(xf - u(l + e') Af(x) - log(l - u(l + e') Af(x)) (43) 

= g(0) - |tt(l + e')^Af(x)^ + u(l + e') A/(x) + log(l - u(l + e') A/(x))| 

+ u((l + e'f - l)Af(x)^ 

where we added and subtracted u{l + e')^Aj-(x)^ so as to obtain the final equality. 

We now prove inequality (|40a|). Now setting u = u : = —^ ^ ~ , which satisfies the 

'- l+(l+e')A/(a;) 

conditions of Lemma yields 


g{u) < 5(0) - (1 + e') Af{x) + log(l + (1 + e') A/(x)) + /t ^ ' 

1 + (1 + e)Af{x) 

1^2 

Making use of the standard inequality —rt+log(l+u) < — (for instance, see the book [3]), 
we find that 


giu) < 5 ( 0 ) - 


lil + e')^Afixf (e'V2e')A/(x) 


+ 


1 + (1 + e')A/(x) 1 + (1 + e')A/(x) 


,1 1 


= g{0)-{---e^^-e')Af{x)^u 

< 5 ( 0 ) - oAfixfu, 

where the hnal inequality follows from our assumption a < ~ ^■ This completes the 

proof of the bound (40a). Finally, the lower bound (40b) follows by setting u = bu into the 
decrement inequality (41). 


6.2.2 Proof of Lemma j^b) 

The proof of this part hinges on the following auxiliary lemma: 
Lemma 6. For all e G (0,1/2), we have 


. , . ^ (1 + e)A/(x) + eA/(x) 

A/(xjvsif) < — 

(l-(l + e)A/(x)j 

(1 — e) Aj(xjvsk) ^ Aj-(xjvsif) ^ (1 + £)Af{xffSK) ) 

2 

where all hounds hold with probability at least 1 — . 


and 


See Appendix A.4 for the proof. 

We now use Lemmato prove the two claims in the lemma statement. 


(44a) 

(44b) 
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Proof of the bound (38a): Recall from the theorem statement that rj := ^ 




By examining the roots of a polynomial in e, it can be seen that rj < 


(1 + e) A/(x*) < (1 + e)A/(x*) < (1 + e) r/ < — 


By applying the inequalities (44b), we have 


1 + 6 ' 


1 + 6 


(1 + 6)A/(x) < ^ ^ Yg 


whence inequality (44a) implies that 

^\f{x) + €Xf{x) 


A/(a:NSK) + 




/ 16 256 A , , , 16 , , , 


(45) 


(46) 


Here the final inequality holds for all e G (0,1/2). Combining the bound (44b) with inequal 


ity (46) yields 


~ 16~~ 

A/(xnsk) < (1 + 6)A/(a:NSK) < (1 + 6)(—) A/(x) < A/(x), 

where the final inequality again uses the condition e G (0, \). This completes the proof of the 


bound (38a). 


the bound (46). 


Proof of the bound (38b): This inequality has been established as a consequence of proving 


6.3 Proof of Theorem |3] 

Given the proof of Theorem it remains only to prove the following modified version of 
Lemma It applies to the exact and sketched Newton directions Unej'^nsk £ that are 
defined as follows 

Une : = arg nun f{x)^^^z\\l + {z, Vf{x)) + \{z, V'^g{x)z)\, (47a) 

z£.C—x K Z A ) 

Unsk = arg nun | l\\SV^z\\l + {z, Vf{x)) + ]-{z, V^g{x)z) |. (47b) 

zGC—x 6 Z Z J 

' -V-' 

^(z;5) 


Thus, the only difference is that the Hessian V^/(x) is sketched, whereas the term \7‘^g{x) 
remains unsketched. 


Lemma 7. Let S G be a sub-Gaussian or ROS sketching matrix, and let x G be a 

(possibly random) vector independent of S. If m > comax 3 ;gc IG) ^ ^ then 


~ ^ATs) 


< 6 




with probability at least 1 — cie 


(48) 
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7 Discussion 


In this paper, we introduced and analyzed the Newton sketch, a randomized approximation 
to the classical Newton updates. This algorithm is a natural generalization of the Itera¬ 
tive Hessian Sketch (IHS) updates analyzed in our earlier work [19]. The IHS applies only 
to constrained least-squares problems (for which the Hessian is independent of the iteration 
number), whereas the Newton Sketch applies to any any twice differentiable function subject 
to a closed convex constraint set. We described various applications of the Newton sketch, 
including its use with barrier methods to solve various forms of constrained problems. For 
the minimization of self-concordant functions, the combination of the Newton sketch within 
interior point updates leads to much faster algorithms for an extensive body of convex opti¬ 
mization problems. 

Each iteration of the Newton sketch always has lower computational complexity than 
classical Newton’s method. Moreover, it has lower computational complexity than first-order 
methods when either n > or d > n'^ (using the dual strategy); here n and d denote the 
dimensions of the data matrix A. In the context of barrier methods, the parameters n and d 
typically correspond to the number of constraints and number of variables, respectively. In 
many “big data” problems, one of the dimensions is much larger than the other, in which case 
the Newton sketch is advantageous. Moreover, sketches based on the randomized Hadamard 
transform are well-suited to in parallel environments: in this case, the sketching step can be 
done in 0{logm) time with 0{nd) processors. This scheme significantly decreases the amount 
of central computation—namely, from 0{m?‘d + ndlogm) to 0{m?d -|- logd). 

There are a number of open problems associated with the Newton sketch. Here we focused 
our analysis on the cases of sub-Gaussian and randomized orthogonal system (ROS) sketches. 
It would also be interesting to analyze sketches based on coordinate sampling, or other forms of 
“sparse” sketches (for instance, see the paper [TO]). Such techniques might lead to significant 
gains in cases where the data matrix A is itself sparse: more specifically, it may be possible 
to obtain sketched optimization algorithms whose computational complexity only scales with 
number of nonzero entries in the data matrices the full dimensionality nd. Finally, it would be 
interesting to explore the problem of lower bounds on the sketch dimension m. In particular, 
is there a threshold below which any algorithm that has access only to gradients and m- 
sketched Hessians must necessarily converge at a sub-linear rate, or in a way that depends on 
the strong convexity and smoothness parameters? Such a result would clarify whether or not 
the guarantees in this paper are improvable. 
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A Technical results for Theorem [2] 

In this appendix, we collect together various technical results and proofs that are required in 
the proof of Theorem 
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A.l Proof of Lemma 

Let u be a unit-norm vector independent of S, and consider the random quantities 


Zi(5',x) 

= inf 

||5u||| and 

(49a) 



Z2{S,x) 

= sup 

(U, {S^S-In)v) . 

(49b) 




By the optimality and feasibility of Unsk and Unb (respectively) for the sketched Newton 
update (36), we have 

-||S'V^/(x)^^^Unsk||2 ~ {VnSK, V/(x)) < -||V^/(x)^^^Une||2 ~ {Vne, V/(x)). 

Defining the difference vector e : = Unsk — '^ne, some algebra leads to the basic inequality 
^||5V2/(x)'/2e||2 < + (e, V/(x)). (50) 


Moreover, by the optimality and feasibility of Une and Unsk for the exact Newton update (35), 
we have 


(V^/(x)unb - V/(x), e) = (V^/(x)une - V/(x), Unsk - Wne) > 0. 


(51) 


Consequently, by adding and subtracting (V^/(x)une) we find that 

(V2/(^)'/Ve, {ln-S^S)V^fix)^/^^ 


(52) 


By dehnition, the error vector e belongs to the cone /C* and the vector V^/(x)^/^Une is fixed 
and independent of the sketch. Consequently, invoking definitions (49a) and ( |49b ) of the 
random variables Zi and Z 2 yields 

l||svV(x)'-'^?lli > f ||vV(x)‘'^siii, 

(V2/(x)^/2xnb, {In - S^S)V^f{x)^/^\ < Z2\\V^f{xy/^v^^h l|V2/(a^)^/'e||2, 

Putting together the pieces, we find that 


^^/(^)^^^('^nsk ~ I^ne) 


^ 2^2(5, x) 
2 “ Zi{S, x) 


VV(2:)^'^^(xne) 


(53) 


Finally, for any 6 £ (0,1), let us dehne the event S{6) = {Zi >1 — 6, and Z 2 < 5}. 
By Lemma 4 and Lemma 5 from our previous paper [20], we are guaranteed that P[<f((5)] > 
1 — Conditioned on the event £{6), the bound (53) implies that 


^^/(®)^^^('*^NSK ~ 'J^Ne) 


2,5 

<- 

2 1 — ,5 


V^f{x)^^‘^{vj,E) 


By setting ,5 = |, the claim follows. 
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A.2 Proof of Lemma |4] 

By construction, the function g{u) = f{x + uUnsk) is strictly convex and self-concordant. 
Consequently, it satisfies the bound ^ {g "< 1, whence 

^ A du < s. 

or equivalently g”{s) < for ^ ^ dom^n [0, Integrating this inequality 

twice yields the bound 

g{u) < g{0) + ug{0) - ug”{0)^^'^ - log(l - ug”{0)^^‘^). (54) 

Since g'{u) = (V/(x uunsk), r’NSK) and g"{u) = (unsk, V2/(x- h unNSK)r’NSK), the decrement 
bound @ follows. 


A.3 Proof of Lemma [5] 


We perform this analysis conditional on 
that 


the bound (37) from Lemmaj^ We begin by observing 


||[V^/(x)]fo^nNSK||2 < ||[VV(a;)]^'^^nNE||2 + || [VV(a;)]^'^^(r’NSK - 'yNB)||2 
= ^fix) + l|[V^/(a;)]^'^^(r'NSK - r'NE)||2 • 


(55) 


Lemma implies that ||V^[/(x)]^/^(nNSK — r'NE )||2 < e||V^[/( x)]^/^Une ||2 = eA/(x). In con¬ 
junction with the bound (56), we see that 


||[VV(a:)]^/^nNSK||2 < (1 + e)A/(x). 


(56) 


Our next step is to lower bound the term (V/(x), Unsk): in particular, by adding and sub¬ 
tracting a factor of the original Newton step Une, we find that 

(V/(x), Unsk) = ([VV(x)]-'/2V/(x), V2[/(x)]fo\HSK) 

= ([V2/(x)]-'/2V/(x), + {[V‘^f{x)]-^/^Vfix), V^[fix)]^/\v^sK - v)) 

= -||V2[/(x)]-fo2v/(x)||i + ([VV(x)]-fo2V/(x), V2[/(x)]fo2(^^3^ _ 

<-||V2[/(x)]-fo2v/(x)||i + ||[v2/(x)]-fo2v/(x)||2||V2[/(x)]i/2(^^^^_^^^)||^ 

= -A/(a;)^ A/(x)||V2[/(x)]fo^('(;NSK - 'f^NE)||2 

(57) 

where the final step again makes use of Lemma Repeating the above argument in the 
reverse direction yields the lower bound (V/(x), Unsk) > “Aj-(x)^(l -|- e), so that we may 
conclude that 


|A/(x‘) - A/(x‘)| < eA/(x*). (58) 

Finally, by squaring both sides of the inequality ( |56[ ) and combining with the above bounds 
gives 


I|[A7^/(x)]^^^Unsk||2 ^ 


-(1+6)^ 
1 - e 


(V/(x), Unsk) 


(1 + e)^ 

1 - e 



< 


I + e 


I - e 


2 



as claimed. 
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A.4 Proof of Lemma |6] 


We have already proved the bound (44b) during our proof of Lem ma [S] — in particular, see 
equation (58). Accordingly, it remains only to prove the inequality (44a). 

Introducing the shorthand A : = (1 + e)Xf{x), we first claim that the Hessian satisfies the 
sandwich relation 


(1 - safV^f{x) ^ V^/(x + sunsk) ^ , 

(1 — sa)^ 


(59) 


for |1 — sa| < 1 where a = (1 + e)Aj-(x), with probability at least 1 — cie . Let us recall 
Theorem 4.1.6 of Nesterov m- it guarantees that 


(1 - s||unsk|U)^V^/(x) ^ V^/(x + sunsk) ^ 


(1 — s||unsk|U)^ 


V^fix) . 


(60) 


Now recall the bound (37) from Lemma combining it with an application of the triangle 
inequality (in terms of the semi-norm ||u||a: = || V^/(x)^/^u|| 2 ) yields 


V^/(a:)^/^UNSK ^<(l + e) V^/(x)^/^Une ^ = (1 + e)||i^NE|U , 


with probability at least 1 — e ^ and substituting this inequality into the bound (60) 


yields the sandwich relation (59) for the Hessian. 


Using this sandwich relation (59), the Newton decrement can be bounded as 


Xf{X]sjSK) — ||V^/(xnsk) ^^^V/(xnsk)II2 

< -- . ^. I|vV(x)-^/V/(xnsk)||2 


(l-(l + e)A/(x)) 
1 

(l-(l + e)A/(x)) 
1 

(l-(l + e)A/(x)) 


fi^) + J T S't’NSK)'yNSK 

v^/(x)“^/^ (vf{x) + j v^/(x -h suNSK)'fNB ds -h 


where we have defined A = V^/(x -|- sunsk) ('^nsk — 'J^nb) ds. By the triangle inequality, we 

write A/(xnsk) < + ^ 2 ), where 


can 


Ml : = 


V^/(x) ^/‘^(vf{x) + j V^f{x+ tVj,sK)vi,Edt^ , and M 2 : = V^/(x) ^/^A 


In order to complete the proof, it suffices to show that 

(1-|-e)Af(x)^ e\f(x) 

Mi <^—— ^ ; , , and M 2 < ’ 


1 - (1 -he)A/(x) 


1 - (1 -he)A/(x) 
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Bound on Mi: Re-arranging and then invoking the Hessian sandwich relation (|59|) yields 
Ml = 

Jo '' 

1 


< 


+ sr;NSK)VV(x)-^/" - l) ds 
V^/(x)^/^r;NE 


0 V(l-s(l + e)^/(a^))^ 


— 1 ds 


^ (l + e)A/(x) 

1 - (1 -ke)A/(x) 
{^ + e)\){x) 

1 - (1 -ke)A/(x) 

Bound on M 2 : We have 

M 2 = 


V^/(x)^/^Xnb 


< 


(i) 

< 


[ V^/(x) ^/^V^/(x-h sxnsk)V^/(x) ^/^dsV^/(x)^/^(xNSK - i'ne) 
Jo 

2<^'SV^/(x)^'^^(Xnsk ~ 'J^ne) 


1 


lo (1 - ■s(l+ e)A/(x)) 
1 


1 - (1 -ke)A/(x) 

1 

1 - (1 -he)A/(x)' 

eA/(x) 


^^/(^)^'^^('^NSK ~ '^Ne) 

V^/(x)^/^Xne 


1 - (1 -ke)A/(x) ’ 

where the inequality in step (i) follows from Lemma 

B Proof of Lemma 0 

The proof follows the basic inequality argument of the proof of Lemma Since Xnsk and 
Une are optimal and feasible (respectively) for the sketched Newton problem ( |47b ), we have 
'I'(unsk; S) < T(xne; *5'). Defining the difference vector e : = Unsk — v, some algebra leads to 
the basic inequality 

^\\SV‘^f{x)^/‘^e\\l + ^(e, V^g{x)'^ < f{x)^/‘^v^T„ S'^SV^f{x)^^‘^'^ 

+ (e, (V/(x) - V^fi((x))uNE)- 

On the other hand since x^e and Xnsk are optimal and feasible (respectively) for the Newton 
step (47a), we have 

(V2/(x)xne + S7^g{x)v^E - V/(x), e) > 0. 

Consequently, by adding and subtracting (V^/(x)xne) e), we find that 

\\\SV^f{x)^^^eg + ^(x^E, V25 (x)xne) < |(V2/(x)'/2xne, (4 - S)V^f{xg/^e) 

By convexity of g, we have V‘^g{x) ^ 0, whence 

\\\SV‘^f{xg/mi <\{V^f{xg/\^^, {ln-S^S)V‘^f{xg/^^ 

Given this inequality, the remainder of the proof follows as in the proof of Lemma 
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C Gaussian widths with £i-constraints 


In this appendix, we state and prove an elementary lemma that bounds for the Gaussian 
width for a broad class of £i-constrained problems. In particular, given a twice-differentiable 
convex function 'i/i, a vector c G M'^, a radius R and a collection of d-vectors {ai}r=i) consider 
a convex program of the form 


mm 

x£C 


n 

{ where C = { 


= lx G 


|3;||i < R}- 


(61) 


2=1 


Lemma 8. Suppose that the ii-constrained program (61) has a unique optimal solution x* 
such that ||x*||o < s for some integer s. Then denoting the tangent cone at x* by fC, then 


where 


max>V(V^/(x)^/^/C) < 6-\/slogd 
xgC 



max ||^j ||2 



V’min = min min tp"{{m, x),yi), and V’max = max max f)"{{m, x),yi). 
xGC xGC 


Proof. It is well known p!6[ |20j that the tangent cone of the fi-norm at any 5-sparse solution 
is a subset of the cone {z G | ||z||i < 2y^||z||2}. Using this fact, we have the following 
sequence of upper bounds 

>V(V^/(x)^'^^/C) = max {w, V‘^f{x)^^‘^z) 

z^V'^f{x)z=l , 
zeK. 

= ^W . max {w, di&g(Ip”{{Oi, x),yi)Y^'^ Az) 

2:^A^diag('i/’^^((ai, x)x,yi))Az=l , 
z^K 

< E^ max (re, diag x),yj)) Az) 

zTXTAzKII^'P^,^ 
z&K 


< E„ 


max diag (V’"((ai, x),?/i))^^^242;) 

IPII^£ _ - _ 

^ =Eu; ll^'^diag ('!/^"((ai, x),yi))^'^^t(;||oo 


v'UU)^ 

2^^ 1 

f Is {A) 


Jf 
min 


^E„ 


max W WiAij'ip”{{ai, x),yA^'^ 


2=l,...,n 


Qj 


Here the random variables Qj are zero-mean Gaussians with variance at most 

A‘fj'lp”{{ai, x),yi) < V’maxIAilli- 

2 =l,...,n 

Consequently, applying standard bounds on the suprema of Gaussian variates [12], we obtain 

E^ max Y WiAijip”{{ai, x),yA^'^ < Sy^log dyV’max, max ||Hj||2. 
j=l,...,d I . ^^ j=l,...,d 

2 = 1,...,72 

When combined with the previous inequality, the claim follows. □ 
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